AITopics

2605.08352

Country:

North America > United States (1.00)
Europe (1.00)
North America > Canada (0.68)

Genre: Workflow (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Neural Information Processing SystemsMay-1-2026, 04:57:35 GMT

e444859b2a22df6b56af9381ad1e9480-Paper-Conference.pdf

artificial intelligence, data mining, machine learning, (19 more...)

Country: North America > United States (0.68)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

arXiv.org Machine LearningMay-1-2026

Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

Lovig, Max

In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and $100\%$ data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.

artificial intelligence, assumption, machine learning, (18 more...)

2604.27883

Country: North America > Canada (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Neural Information Processing SystemsApr-25-2026, 03:11:21 GMT

Robust Regression Revisited: Acceleration and Improved Estimation Rates

Parameter estimation in generalized linear models, such as linear and logistic regression problems, is among the most fundamental and well-studied statistical optimization problems. It serves as the primary workhorse in statistical studies arising from a variety of disciplines, ranging from economics [Smi12], biology [VGSM05], and the social sciences [Gor10].

algorithm, artificial intelligence, machine learning, (14 more...)

Country:

Europe (0.67)
North America > United States > California (0.46)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Neural Information Processing SystemsApr-24-2026, 15:17:41 GMT

Derivations of Formulas

We have omitted a number of complicated formulas in the main text to provide clear intuition and concise proof sketch. We will list all mentioned formulas here for readers' reference. We consider the case where U = V = Aand Σ is symmetric and full-rank, and we use gradient flow. We can derive the dynamics of S = AA>as S:= (Σ S)S+ S(Σ S), which is a quadratic ordinary differential equation and it is hard to solve directly. For simplicity, define X:= X Σ 1. Then X = XΣ ΣX. (24) Solving this equation and we have And it is interesting to verify that S(t) + P(t) Σ by using the following lemma.

artificial intelligence, eigenvalue, equation, (13 more...)

Technology: Information Technology > Artificial Intelligence (0.47)

Neural Information Processing SystemsApr-24-2026, 15:17:37 GMT

Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization

This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of U and V). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the U and V. This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.

artificial intelligence, gradient descent, machine learning, (17 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Zhou, Yichen, Tu, Stephen

CLT-Optimal Parameter Error Bounds for Linear System Identification

arXiv.org Machine LearningApr-24-2026

There has been remarkable progress over the past decade in establishing finite-sample, non-asymptotic bounds on recovering unknown system parameters from observed system behavior. Surprisingly, however, we show that the current state-of-the-art bounds do not accurately capture the statistical complexity of system identification, even in the most fundamental setting of estimating a discrete-time linear dynamical system (LDS) via ordinary least-squares regression (OLS). Specifically, we utilize asymptotic normality to identify classes of problem instances for which current bounds overstate the squared parameter error, in both spectral and Frobenius norm, by a factor of the state-dimension of the system. Informed by this discrepancy, we then sharpen the OLS parameter error bounds via a novel second-order decomposition of the parameter error, where crucially the lower-order term is a matrix-valued martingale that we show correctly captures the CLT scaling. From our analysis we obtain finite-sample bounds for both (i) stable systems and (ii) the many-trajectories setting that match the instance-specific optimal rates up to constant factors in Frobenius norm, and polylogarithmic state-dimension factors in spectral norm.

artificial intelligence, assumption 2, machine learning, (18 more...)

2604.2127

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Tinati, Mohammad, Tu, Stephen

On the Asymptotics of Self-Supervised Pre-training: Two-Stage M-Estimation and Representation Symmetry

arXiv.org Machine LearningMar-31-2026

Self-supervised pre-training, where large corpora of unlabeled data are used to learn representations for downstream fine-tuning, has become a cornerstone of modern machine learning. While a growing body of theoretical work has begun to analyze this paradigm, existing bounds leave open the question of how sharp the current rates are, and whether they accurately capture the complex interaction between pre-training and fine-tuning. In this paper, we address this gap by developing an asymptotic theory of pre-training via two-stage M-estimation. A key challenge is that the pre-training estimator is often identifiable only up to a group symmetry, a feature common in representation learning that requires careful treatment. We address this issue using tools from Riemannian geometry to study the intrinsic parameters of the pre-training representation, which we link with the downstream predictor through a notion of orbit-invariance, precisely characterizing the limiting distribution of the downstream test risk. We apply our main result to several case studies, including spectral pre-training, factor models, and Gaussian mixture models, and obtain substantial improvements in problem-specific factors over prior art when applicable.

artificial intelligence, machine learning, pre, (18 more...)

2603.27631

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Neural Information Processing SystemsMar-17-2026, 13:30:28 GMT

Alternating minimization for dictionary learning with random initialization

Our algorithm is a simple alternating minimization procedure that switches between $\ell_1$ minimization and gradient descent in alternate steps. Dictionary learning and specifically alternating minimization algorithms for dictionary learning are well studied both theoretically and empirically. However, in contrast to previous theoretical analyses for this problem, we replace a condition on the operator norm (that is, the largest magnitude singular value) of the true underlying dictionary $A^*$ with a condition on the matrix infinity norm (that is, the largest magnitude term). This not only allows us to get convergence rates for the error of the estimated dictionary measured in the matrix infinity norm, but also ensures that a random initialization will provably converge to the global optimum. Our guarantees are under a reasonable generative model that allows for dictionaries with growing operator norms, and can handle an arbitrary level of overcompleteness, while having sparsity that is information theoretically optimal. We also establish upper bounds on the sample complexity of our algorithm.

artificial intelligence, machine learning, proceedings, (8 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.60)